Path Set Operations for Clipping of Parts of Web Pages and Information Extraction from Web pages
نویسندگان
چکیده
It is attractive to extract parts of Web pages for the following two purposes. One is to clip parts of Web pages as we clip articles of newspapers. Another is to utilize information on Web pages by software. In this paper we define operations to extract parts of Web pages, namely path set operations. The operations are for both clipping of parts of Web pages and information extraction from Web pages. Web page clipping is extraction of parts of Web pages keeping their view information. Information extraction from Web pages is transformation of Web pages to tractable structures. We show that we can easily extract parts of Web pages for the two purposes by the operations, and also show that procedures based on the operations are somewhat robust against update of Web pages.
منابع مشابه
Analyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملبررسی ارتباط بین کیفیت اطلاعات و شاخص های ظاهری در صفحات وب فارسی مرتبط با حوزه سلامت عمومی
Introduction: One approach to evaluate the quality of a web page is to investigate its external markers. The purpose of the present study is to determine the relationship between information quality of Persian public health web pages and their external quality. Methods: The samples of this correlation study were selected from among the freely available ten-key word texts of chronic diseases...
متن کاملA Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملارزیابی کیفیت صفحات وب پژوهشگاههای وابسته به وزارت علوم، تحقیقات و فنآوری مستقر در شهر تهران از دیدگاه کاربران
Especially in research centers, evaluating the quality of web pages from clients' point of view has a constructive role in their design and development, since it makes the web developers familiar with client's perspective and assists them in designing client-oriented web sites in scientific and research environment. As a model for assessing the quality of web pages, "webQual" attempts to provid...
متن کامل